library(galah)
library(dplyr)
library(here)
library(ggplot2)
galah_config(email = "your-email-here")
pardalotes <- galah_call() |>
identify("Pardalotus") |>
filter(year == 2015) |>
select(genus,
species,
scientificName,
cl22,
year,
month,
decimalLatitude,
decimalLongitude) |>
atlas_occurrences()
pardalotes_taxonomy <- galah_call() |>
identify("Pardalotus") |>
atlas_species()Appendix D — Joins
If you work with biodiversity data, odds are high that you will need to join 2 separate datasets at some point to test how spatial, temporal or environmental factors influence species. This chapter is a brief overview of several common types of joins dplyr to help get started. For a comprehensive introduction to joins, check out the Joins chapter in R for Data Science.
D.0.1 Prerequisites
In this chapter, we will use pardalote occurrence data from 2015 in the ALA.
D.1 Keys
Joining dataframes relies on setting a key—one or more columns that exist in a primary table that correspond to one or more columns in a second table. Two dataframes that we intend to join are matched according to the designated key.
As a basic example, let’s say we want to add complete taxonomic information to our pardalotes dataframe, which contains occurrence records with some, but not all, taxonomic information. pardalotes_taxonomy contains complete taxonomic information for Pardalotus.
pardalotes# A tibble: 36,132 × 9
genus species scientificName cl22 year month decimalLatitude
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Pardalotus Pardalotus punct… Pardalotus (P… Vict… 2015 10 -38.1
2 Pardalotus Pardalotus stria… Pardalotus (P… Vict… 2015 5 -38.1
3 Pardalotus Pardalotus punct… Pardalotus (P… Vict… 2015 4 -38.2
4 Pardalotus Pardalotus stria… Pardalotus (P… Aust… 2015 8 -35.2
5 Pardalotus Pardalotus stria… Pardalotus (P… Sout… 2015 10 -33.7
6 Pardalotus Pardalotus punct… Pardalotus (P… Aust… 2015 1 -35.2
7 Pardalotus Pardalotus stria… Pardalotus (P… New … 2015 11 -34.9
8 Pardalotus Pardalotus punct… Pardalotus (P… Vict… 2015 10 -38.2
9 Pardalotus Pardalotus stria… Pardalotus (P… Quee… 2015 6 -19.2
10 Pardalotus Pardalotus punct… Pardalotus (P… Vict… 2015 12 -37.6
# ℹ 36,122 more rows
# ℹ 2 more variables: decimalLongitude <dbl>, cl959 <chr>
pardalotes_taxonomy# A tibble: 4 × 11
taxon_concept_id species_name scientific_name_auth…¹ taxon_rank kingdom phylum
<chr> <chr> <chr> <chr> <chr> <chr>
1 https://biodive… Pardalotus … (Gmelin, 1789) species Animal… Chord…
2 https://biodive… Pardalotus … (Shaw, 1792) species Animal… Chord…
3 https://biodive… Pardalotus … Gould, 1838 species Animal… Chord…
4 https://biodive… Pardalotus … Gould, 1838 species Animal… Chord…
# ℹ abbreviated name: ¹scientific_name_authorship
# ℹ 5 more variables: class <chr>, order <chr>, family <chr>, genus <chr>,
# vernacular_name <chr>
Let’s join our pardalotes dataframe with pardalotes_taxonomy. The column scientificName in pardalotes appears to contain (in general) the same information in column species_name in pardalotes_taxonomy.
pardalotes |>
select(scientificName)# A tibble: 36,132 × 1
scientificName
<chr>
1 Pardalotus (Pardalotus) punctatus
2 Pardalotus (Pardalotinus) striatus
3 Pardalotus (Pardalotus) punctatus
4 Pardalotus (Pardalotinus) striatus
5 Pardalotus (Pardalotinus) striatus
6 Pardalotus (Pardalotus) punctatus
7 Pardalotus (Pardalotinus) striatus
8 Pardalotus (Pardalotus) punctatus
9 Pardalotus (Pardalotinus) striatus
10 Pardalotus (Pardalotus) punctatus
# ℹ 36,122 more rows
pardalotes_taxonomy |>
select(species_name)# A tibble: 4 × 1
species_name
<chr>
1 Pardalotus (Pardalotinus) striatus
2 Pardalotus (Pardalotus) punctatus
3 Pardalotus (Pardalotinus) rubricatus
4 Pardalotus (Pardalotus) quadragintus
We can use these columns as a key to join the additional taxonomic information to our pardalotes occurrence records. We can also use the genus column as a second key to match any observations that have been identified down to genus, but not species, level.
pardalotes |>
left_join(pardalotes_taxonomy,
join_by(scientificName == species_name,
genus == genus)
) |>
rmarkdown::paged_table() # paged outputD.2 Basic types of joins
There are many types of joins that can help you in all kinds of situations! Join types generally fall within two categories:
- Mutating joins combine variables from two tables (e.g.,
left_join(),right_join(),full_join()) - Filtering joins combine variables, and additionally keep or remove rows that do not match the key column (e.g.,
semi_join(),anti_join())
Below are a few common examples of join types. Examples and animations are taken from Garrick Aden-Buie’s tidyexplain animations.
x <- tibble(id = c(1, 2, 3),
x = c("x1", "x2", "x3"))
y <- tibble(id = c(1, 2, 4),
y = c("y1", "y2", "y4"))
left_join(x, y, join_by(id == id))# A tibble: 3 × 3
id x y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>

right_join(x, y, join_by(id == id))# A tibble: 3 × 3
id x y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y4

full_join(x, y, join_by(id == id))# A tibble: 4 × 3
id x y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
4 4 <NA> y4

semi_join(x, y, join_by(id == id))# A tibble: 2 × 2
id x
<dbl> <chr>
1 1 x1
2 2 x2

anti_join(x, y, join_by(id == id))# A tibble: 1 × 2
id x
<dbl> <chr>
1 3 x3
D.3 Spatial joins
It can be useful to summarise where species or taxonomic groups occur by grouping by a spatial region (e.g., state, council area, bioregion). To do so, records or summary statistics need to be linked to the region they pertain to. Usually, this requires a join between a spatial object and a dataframe, or between two spatial objects.
As an simple example, let’s download a shapefile of Australian states and territories using the ozmaps package. Our aus object contains the name of each state/territory (NAME) and its shape (geometry)1.
library(sf)
library(ozmaps)
aus <- ozmap_states |>
st_transform(4326)
aus- 1
- This line transforms the CRS projection of our map to match the CRS projection of ALA data.
Simple feature collection with 9 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 105.5507 ymin: -43.63203 xmax: 167.9969 ymax: -9.229287
Geodetic CRS: WGS 84
# A tibble: 9 × 2
NAME geometry
* <chr> <MULTIPOLYGON [°]>
1 New South Wales (((150.7016 -35.12286, 150.6611 -35.11782, 150.6…
2 Victoria (((146.6196 -38.70196, 146.6721 -38.70259, 146.6…
3 Queensland (((148.8473 -20.3457, 148.8722 -20.37575, 148.85…
4 South Australia (((137.3481 -34.48242, 137.3749 -34.46885, 137.3…
5 Western Australia (((126.3868 -14.01168, 126.3625 -13.98264, 126.3…
6 Tasmania (((147.8397 -40.29844, 147.8902 -40.30258, 147.8…
7 Northern Territory (((136.3669 -13.84237, 136.3339 -13.83922, 136.3…
8 Australian Capital Territory (((149.2317 -35.222, 149.2346 -35.24047, 149.271…
9 Other Territories (((167.9333 -29.05421, 167.9188 -29.0344, 167.93…
Our pardalotes data also contains the state/territory of each occurrence in column cl22. We can group by state/territory and summarise the number of occurrences to get an overall count by state/territory.
counts_by_state <- pardalotes |>
group_by(cl22) |>
count()
counts_by_state# A tibble: 9 × 2
# Groups: cl22 [9]
cl22 n
<chr> <int>
1 Australian Capital Territory 3982
2 New South Wales 7652
3 Northern Territory 549
4 Queensland 6687
5 South Australia 1896
6 Tasmania 1561
7 Victoria 11123
8 Western Australia 2522
9 <NA> 160
To prepare our data for mapping, we can join counts_by_state to aus using the state/territory name as our key.
aus_counts <- aus |>
left_join(counts_by_state,
join_by(NAME == cl22))
aus_countsSimple feature collection with 9 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 105.5507 ymin: -43.63203 xmax: 167.9969 ymax: -9.229287
Geodetic CRS: WGS 84
# A tibble: 9 × 3
NAME geometry n
<chr> <MULTIPOLYGON [°]> <int>
1 New South Wales (((150.7016 -35.12286, 150.6611 -35.11782,… 7652
2 Victoria (((146.6196 -38.70196, 146.6721 -38.70259,… 11123
3 Queensland (((148.8473 -20.3457, 148.8722 -20.37575, … 6687
4 South Australia (((137.3481 -34.48242, 137.3749 -34.46885,… 1896
5 Western Australia (((126.3868 -14.01168, 126.3625 -13.98264,… 2522
6 Tasmania (((147.8397 -40.29844, 147.8902 -40.30258,… 1561
7 Northern Territory (((136.3669 -13.84237, 136.3339 -13.83922,… 549
8 Australian Capital Territory (((149.2317 -35.222, 149.2346 -35.24047, 1… 3982
9 Other Territories (((167.9333 -29.05421, 167.9188 -29.0344, … NA
Now we can use these data to create a choropleth map2.
ggplot() +
geom_sf(data = aus_counts,
aes(fill = n)) +
guides(fill = guide_coloursteps(title = "Number of\nObservations")) +
scale_fill_viridis_c(option = "G") +
theme_void()
The sf package also has specialised functions for spatial joins like st_join(), which can be especially useful for joins using points (e.g., POINT) and shapes (e.g., POLYGON, MULTIPOLYGON). Below is a small example where we use the point location to join with the state/territory. Note that we lose the POINT location in favour of the state MULTIPOLYGON shape, held in the column geometry.
# convert record coordinates to sf POINT class
pardalotes_sf <- pardalotes |>
tidyr::drop_na() |>
st_as_sf(coords = c("decimalLongitude", "decimalLatitude"),
crs = 4326)
# join points to aus states that intersect spatially
states_with_species <- st_join(x = aus,
y = pardalotes_sf,
join = st_intersects,
left = FALSE)
states_with_species |>
rmarkdown::paged_table() # paged outputThis shapefile contains state/territory outlines as vectors. See this section on vectors to learn more about what a vector is.↩︎
For more advanced examples of making choropleth maps, check out the ALA Labs articles counting points in shapefiles and using multiple colour scales.↩︎